arXiv:1506.05860v3 [stat.ML] 18 May 2016 


Variational Gaussian Copula Inference 


Shaobo Han Xuejun Liao David B. Dunson^ Lawrence Carin 

Department of ECE, Department of Statistical Science^, Duke University, Durham, NC 27708, USA 


Abstract 

We utilize copulas to constitute a unified 
framework for constructing and optimizing 
variational proposals in hierarchical Bayesian 
models. For models with continuous and 
non-Gaussian hidden variables, we propose 
a semiparametric and automated variational 
Gaussian copula approach, in which the para¬ 
metric Gaussian copula family is able to pre¬ 
serve multivariate posterior dependence, and 
the nonparametric transformations based on 
Bernstein polynomials provide ample flexibil¬ 
ity in characterizing the univariate marginal 
posteriors. 


1 Introduction 


A crucial component of Bayesian inference is approxi¬ 
mating the posterior distribution, which represents the 
current state of knowledge about the latent variables x 
after data y have been observed. When intractable in¬ 
tegrals are involved, variational inference methods find 
an approximation q{x) to the posterior distribution 
p{x\y) by minimizing the Kullback-Leibler (KL) diver¬ 
gence KL{q{x)\\p{x\y)} = f q(x)loglq(x)/p(xly)] dx, 
providing a lower bound for the marginal likelihood. 


To make inference tract able, mean-held vari ational 
Bayes (MFVB) methods_JJordan_e^^yjl99^Wain- 
wright and Jordan7 T2008l l assume a(x) is factorized 
over a certain partition of the latent variables x = 
[xi,...,xj], qvB{x) = l\jqvB{xj), with marginal 
densities in free-form and correlations be¬ 

tween partitions neglected. The s t ructur ed mean- 
field apOToaches_jlSau^^n^JordanL 199^ Hoffman 
and Blei. l2015f l preserve partial correlations and ap¬ 
ply only to models with readily identified substruc- 
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tures. The variational Gaussi an (VG) approxima¬ 
tion j|Bai;bei_and_^ishog|j_ 199^0pper and Archam- 
beau, 20091 ) allows incorporation of correlations by 
postulating a multivariate Gaussian parametric form 
qvaix) = A/’(/.i, S). The VG approximation, with con¬ 
tinuous margins of real variables, are not suitable for 
variables that are inherently positive or constrained, 
skewed, or heavy tai kd. For multi-modal poste riors, 
a mixture of MFVB (jjaakkola and Jordanl . 119981 ) or a 
mixture_of^niforml^;weighte^^aussians_( Gershman 
et ah, 2OI2I) may be employed, which usually requires a 
further lower bound on the average over the logarithm 
of the mixture distribution. 

To address the limitations of current variational meth¬ 
ods in failing to simultaneously characterize the pos¬ 
terior dependencies among latent variables while al¬ 
lowing skewness, multimodality, and other character¬ 
istics, we propose a new variational copula framework. 
Our approach decouples the overall inference task into 
two subtasks: (i) inference of the copula function, 
which captures the multivariate posterior dependen¬ 
cies; {ii) inference of a set of univariate margins, which 
are allowed to take essentially any form. Motivated 
by the work on automated (black - box) variational in¬ 
feren ce dRanganath et ah j_ 201T ■ Mnih and Gregor , 


20 14 Titsiasan^^azaro^Gredillaj201J: Nguyen and 
Bonilla, 20l4 ~ngma and Wellind . l2014[ ). we present 
a stochastic optimization algorithm for generic hier¬ 
archical Bayesian models with continuous variables, 
which (i) requires minimal model-specihc derivations, 
(ii) reproduces peculiarities of the true marginal pos¬ 
teriors, and (in) identifies interpretable dependency 
structure among latent variables. 

Using copulas to improve approximate Bayesian in¬ 
ference is a natural idea t hat has also b een explored 
recently in£ther_contextsJId_et,jjJ^ 201^Ferkingstad 
and j^e^ 2015|}^,^nde2endeny^^o^£u^^or^Tran 


et al. (120151 ) presented a copula augmented variational 
method with fixed-form marginals, and utilizes regu¬ 
lar vines to decompose the multivariate dependency 
structure into bivariate copulas and a nest of trees. 
Our method provides complementary perspectives on 
nonparametric treatment of univariate marginals. 
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Variational Copula Inference 
Framework 


Sklar’s theorem ( Sklail . 19591) ensures that any multi¬ 
variate joint distribution Q can be written in terms of 
univariate marginal distributions Fj{x) = P{Xj < x), 
j = 1,... ,p and a copula which describes the depen¬ 
dence structures between variables, such that 

Q{xi,...,Xp) =C[Fi{xi),...,Fp{xp)]. (1) 

Conversely, if C is a copula and {-Fj}j=i:p are dis¬ 
tribution functions, then the function Q defined by 
0 is a p-dimensional joint distribution function with 
marginal distributions Fi , F- 2 ,... , Fr,, o wing to the 
marginally closed property ( Sond . 2000f) . Assuming 
Q{xi, ...,Xp) has p-order partial derivatives, the joint 
probability density function (PDF) is q(xi,... ,Xp) = 
c&[Fi{xi),...,Fp{xp)]lf^^ifj{xj), where fj{xj) is 
the PDF of the jth variable and it is related to the 
corresponding cumulative distribution function (CDF) 
by Fj{xj) = fj(t)dt, c© is the copula density with 
parameter ©. 

Sklar’s theorem allows separation of the marginal 
distributions Fj{xj) from the dependence structure, 
which is appropriately expressed in the copula func¬ 
tion C. As a modeling tool, the specihed copula func¬ 
tion and mar gins can be directly fitted to the ob¬ 
served data V (Liu et al.l. 20091 : Wanthier and Jordan . 
201(ll : Lopez-Paz et alF 201,ll ) with their aarameters 
optimized v ia Bayesian or maximum likelihood esti¬ 
mators Isee ISmi^ ( 2013 1 and the references therein). 
In contrast, our goal is to use a copula as an infer¬ 
ence engine for full posterior approximation. All the 
unknowns (variables/parameters) in the user-specihed 
hierarchical model are encapsulated into a vector a;, 
and the optimal variational approximation qvci^) to 
the true posterior p{x\y) is found under the Sklar’s 
representation. This approach provides users with 
full modeling freedom and does not require condi¬ 
tional conjugacy between latent variables; thus the ap¬ 
proach is applicable to general models. Within some 
tractable copula family C G C, and assuming F{-) 
and C'(-) to be differentiable, we construct the vari¬ 
ational proposal as qc{x) = fj{xj), where 

u = F{x) = [Fi (xi),..., Fp(xp)], such that the ap¬ 
proximation satisfies 

q^{x) = aTgmmKL{qc{x)\\p{x\y)} 

qci^) 

= argminKL{qc(a^)lb(a:)} - Eg^(^^){lnp{y\x)], 

Qci^) 

where p{y\x) is the likelihood and p{x) is the prior. 
Letting the true posterior p{x\y) in Sklar’s repre¬ 
sentation be p{x\y) = c*{v)Y\j f*{xj), where v = 


[Ff{xi),.. .,F*{xp)], c*{v) and {f*{xj)}j=i..p are the 
true underlying copula density and marginal posterior 
densities, respectively, the KL divergence decomposes 
into additive terms (derivations are provided in Sup¬ 
plementary Material), 

KL{qc{x)\\p{x\y)} = KL{c[F’(a;)]||c*[F’*(a;)]} 

+ ^^KL{/,(x,)||/;(a:,)}. ( 2 ) 

Classical methods, such as MFVB and the VG approx¬ 
imation are special cases of the proposed VC inference 
framework. We next compare their KL divergence un¬ 
der Sklar’s representation and offer a reinterpretation 
of them under the proposed framework. 

2.1 Special Case 1: Mean-field VB 

The mean-field proposal corresponds to the inde¬ 
pendence copula Cuiu) = with free-form 

marginal densities fjixj). Given cu{u) = 1 we have 
qnix) = cn{u)ll^ fj(xj) = Uj = dYsix). If 

MFVB is not fully factorized, i.e. J < p, the indepen¬ 
dence copula is the only copula satisfying the marginal 
close d property, a ccording to the impossibility theo¬ 
rem ( Nelsenl . 20071 ). MFVB assumes an independence 
copula and only optimizes the free-form margins, 

KL{qvBix)\\p{x\y)} = KL{cn[T’(a;)]||c*[F*(a;)]} 

+ ^^KL{/,(a:,)||/;(a:,)}. (3) 

The lowest achievable KL divergence in MFVB 
is KL{qYB{x)\\p(x\y)} = KL{cn[T’(a;)]||c*(F’(a;))}, 
which is achieved when the true posterior marginals 
are found, i.e. Fj = Ff,\/j , in which case the overall 
KL divergence is reduced to the KL divergence be¬ 
tween the independence copula and the true copula. 
As is shown in (j^, the objective function contains two 
terms, both involving marginal CDFs {Fj}j=i.,p. Since 
in general c* 7 ^ cn, the optimal F minimizing the hrst 
term will not be equal to F*. Therefore, minimizing 
© will not lead to the correct marginals and this par¬ 
tially explains the reason why MFVB usually cannot 
find the true marginal posteriors in prac tice (e.g., vari¬ 


ances can be severely underestimated (jNeville et al 


2ni4lD . even though it allows for free-form margins. 


2.2 Special Case 2: VG Approximation 


In hxed-form variational Bayes (jHonkela et al.l . [20l3) , 
such as VG approximation, the multivariate Gaus¬ 
sian proposal qYQ{x) = J\f{x; /x, S) can be written as 
qvaix) = CG{u\T)]Y.^.^(l)j{xj-,pj,a^). VG not only 
assu mes the tru e copula function is a Gaussian cop¬ 
ula (Song, 200tl ) with parameter T = , 
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D = diag(S), but is also restricted to univariate Gaus¬ 
sian marginal densities {4)j{xj] 

KL{gvG(a;)lb(a;|y)} = KL{cG[5>(a:)]||c*[F*(a;)]} 

+ ^^.KL{<^,(x,)||/;(x,)}. (4) 

We can see in 0 that if the margins are misspec- 
ified, even if the true underlying copula is a Gaus¬ 
sian copula, CG = c*, there could still be a discrep¬ 
ancy V ■ KL|(/)i(xi)|| between margins, and 

KL{cG[$(a^)]||c*[F*(a:)j} is not zero. 

Concerning analytical tractability and simplicity, in 
the sequel we concentrate on variational Gaussian 
copula (VGC) proposals constructed via Gaussian 
copula with continuous margins, i.e. qwGc{x) = 
cg(m|T j), where u = , Fp{xp)]. 

Our VGC method extends MFVB and VG, and im¬ 
proves upon both by allowing simultaneous updates of 
the Gaussian copula parameter T and the adaptation 
of marginal densities {fjixj)}j-i.p. First, the univari¬ 
ate margins in VGC is not restricted to be Gaussian. 
Second, the Gaussian copula in VGC is more resistant 
to local optima than the independence copula assumed 
in MFVB and alleviates its variance underestimation 
pitfall, as is demonstrated in Section 15751 


3 Variational Gaussian Copula 
Approximation 


A Gaussian copula function with p x p correla¬ 
tion matrix T is defined as C'g(wi, ..., Mp|T) = 
$p($“i(Mi),...,$“i('Up)|Y) : [0,1]P [0,1] where 

$(•) is a shorthand notation of the CDF of A/’(0,1), 
and <f’p(-|Y) is the CDF of Np{0,T). The Gaussian 
copula density is 


Cg(ui, . . . ,Mp|T) 


—;^=exp 

V\r\ 



where 2 : = [$ ^(ui),...,$ ^{up)]’^. 

In the proposed VGC approximation, the variational 
proposal qvGc{x) is constructed as a product of Gaus¬ 
sian copula density and continuous marginal densities. 
The evidence lower bound (ELBO) of VGC approxi¬ 
mation is 


^G [gvGc(a:)] 



-f H[cGiu)] + H[fj{xj)], (5) 

i=i 


However, directly optimizing the ELBO in ([5]) w.r.t. 
the Gaussian copula parameter T and the univari¬ 
ate marginals {fjixj)}j=i:p often leads to a non-trivial 
variational calculus problem. For computational con¬ 
venience, we present several equivalent proposal con¬ 
structions based on Jacobian transformation and repa¬ 
rameterization. 


3.1 Equivalent Variational Proposals 


We incorporate auxiliary variables z by exploiting the 
latent variable representation of the Gaussian copula: 


= Pj 


-1 


(Mj), Uj = $(Zj), 


Np{ 0 ,r). Let¬ 


ting gj{-) = ($(•)) be bijective monotonic non¬ 

decreasing functions, Xj = gj(zj), \fj, the Jacobian 
transformation gives 


f r ^ 

gvGc(a^) = J n “ 9j(zj)) 


‘-1=1 


qG{z;0,T)dz 


= <?g( 5 ^(a;);0,T) 




'- 1=1 


where d{-) is the Dirac delta function. 

It is inconvenient to directly optimize the correlation 
matrix T of interest, since T is a positive semi-definite 
matrix with ones on the diagonal and off-diagonal el¬ 
ements between [—1,1]. We adopt the parameter ex¬ 
pansi on (PX) technique ( Liu et al] . ll998l : lLiu and 
1999l l. which has been applied in a ccelerating varia¬ 
tional Bayes (IQi and J aakkola, 2006) a nd the sampling 
of correlation matrix ( Talhouk et al.l . [ 2 OI 2 I I. Further 


considering zj = tj '^{zj) = pj -I- CjjZj, z ^ Np{p, S), 
S = DTD^, D = [diag(crjy)]j=i:p, thus Xj = g{zj) = 
g{t{zj)) := h{zj), where hj{-) = gj o tj{-) are also bi¬ 
jective monotonic non-decreasing functions, the varia¬ 
tional proposal is further written as 


/• r P _ ■ 

qVGG{x) = J S{Xj - hj{Zj)) 


‘- 1=1 


qGiz;p.,'E)dz 


= qG{h (a:);M,S) 


n 


^ ^-1 
dx, ^ ' 


Xj) 


Given the transformations {hj}j=i:p, gG(2;/.t,S) can 
be further repar ameterized by the Cholesk y decompo¬ 
sition S = CC‘^|ChalIis_and_Barbeijj 20lJ^Titsias and 
Lazaro-Gredilla7 l20I4ll . where C is a square lower tri¬ 
angular matrix. Table [1] summarizes four translatable 
representations of variational proposals. 


3.2 VGC with Fixed-form Margins 


where Uj = Fj{xj)^ F[[f{x)] = — J f(x) In f(x)dx. 


The ELBO under Sklar’s representation (O is there¬ 
fore translated into the Jacobian representation 
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Table 1: Equivalent Representations of Variational Gaussian Copula (VGC) Proposals 



Posterior Formulation 

Optimization Space 

RO 

Original 

Multivariate (non-Gaussian) density q(x) 

R1 

Sklar’s Representation 

Copula density CG(n|T) 

Univariate marginals {fj(Xj))j=i:p 

R2 

Jacobian Transform 

Gaussian density q{'z) = A/ {0,Y) 

Monotone functions { 9 j{zj)}j^i:p 

R3 

Parameter Expansion 

Gaussian density q{z) = Af{p., CC ^) 

Monotone functions {hj(zj)}j=i:p 


£c[gvGc(ic)] = EaA( 5 ;m,s)[^s( 2 :) - In90(2)], 

P 

is{z,h) = \np{y,h{z)) + ^\nh'j{zj). ( 6 ) 

i=i 

The monotonic transformations hj{-) = 
can be specified according to the desired parametric 
form of marginal posterior, if the inverse CDF F~^ is 
tractable. For example, the multivariate log-normal 
posterior can be constructed via a Gaussian copula 
with log-normal (LN) margins, 

p 

qvGC-LN{x) = C'g(m|T) ]^LN(a:j;/ij,cr|). (7) 

i=i 


where the [hj{hj^{xj))]~^ term is interpreted as 
a marginal-correction term. To guarantee analyt¬ 
ical tractability, we require h{-) to be (z) bijec- 
tive; (m) monotonic non-decreasing; (Hi) having un¬ 
bounded/constrained range; (iv) differentiable with 
respect to both its argument and parameters; and (v) 
sufficiently flexible. We propose a class of continuous 
and smooth transformations h(-) constructed via ker¬ 
nel mixtures that automatically have these desirable 
properties. 

4.1 Continuous Margins Constructed 
via Bernstein Polynomials 


This also corresponds to imposing exponential trans¬ 
form on Gaussian variables, x = h{z) = exp(2), 
z ^ A/"(/x, S). In this case, {f^j,crj}j=i:p controls the 
location and dispersion of the marginal density; h(-) 
does not have any additional parameters to control the 
shape and In h'{zj) = Zj takes a simple form. VGC-LN 
is further discussed in Section 15^ and Section lOl 

Given the copula function (7, we only need to find 
p one-dimensional margins. However, without know¬ 
ing characteristics of the latent variables, specifying 
appropriate parametric form for margins is a difficult 
task in general cases. First, the marginals might ex¬ 
hibit multi-modality, high skewness or kurtosis, which 
are troublesome for particular parametric marginals to 
capture. Second, a tractable inverse GDF with opti- 
mizable arguments/parameters, as required here, are 
available only in a handful of cases. Instead of using 
some arbitrary parametric form, we construct bijective 
transform functions via kernel mixtures, which lead to 
highly flexible (ideally free-form) marginal proposals. 


4 Bernstein Polynomials based 
Monotone Transformations 


The marginal densities in VGC can be recovered 
through Jacobian transformation. 


fjixj) = qoihj — ^{xj) 


= qoih: ^{xj)]Pj,al) 


h'Ah- (xj)) 


, ( 8 ) 


The Bernstein polynomials (BPs) have a uniform con¬ 
vergence property for continuous functions on unit in¬ 
terval [ 0 , 1 ] and have been used for nonparametric den¬ 
sity estimation ( Petrond . Il999l l . It seems more natural 
to use kernel mixtures directly as the variational pro¬ 
posal. However, the difficulty lies in tackling the term 
/(F“^(-)) involving the inverse CDF of mixtures (not 
analytical) and the need of a further lower bound on 
the entropy of mixtures. In this paper, we overcome 
this issue by using a sandwich-type construction of the 
transform li( 2)0 which maps from (— 00 , 00 ) to some 
target range building upon BP, 


k 

B{u;k,uj) = '^ujr,klu{r,k-r + l), (9) 

r—1 


where Iu(r^ k — r+1) is the regularized incomplete beta 
function. $(•) is the standard normal CDF mapping 
from (— 00 , 00 ) to [ 0 , 1 ], and ^'“^(•) is some predefined 
tractable inverse CDF with fixed parameters; for ex¬ 
ample, the inverse CDF of the exponential distribution 
helps map from [ 0 , 1 ] to ( 0 , 00 ) for positive variables. 
B{u]k,u3) relocates the probability mass on the unit 
interval [0,1]. The degree k is an unknown smooth¬ 
ing parameter, and is the unknown mixture weights 
on the probability simplex = {(wi,... ,u!k) ■ tUi > 
— !}• The proposed sandwich-type transfor¬ 
mation avoids the difficulty of specifying any partic¬ 
ular types of marginals, while still leads to tractable 
derivations presented in Section [5j 

^The index j on 'z is temporarily omitted for simplicity, 
and is added back when necessary. 
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4.2 Variational Inverse Transform 


5.1 Coordinate transformations 


Considering a 1-d variational approximation problem 
{x is a scalar, the true posterior f{x) is known up 
to the normalizing constant), fix q(T) = ^(0,1), thus 
u = ^ W[0,1], we can learn the monotonic trans¬ 
formation ^(•) = on the base uniform distribu¬ 

tion qo{u) by solving a variational problem, 

C(-) = argminKL{g(a;)||/(a:)}, x = ^(u) = 


Applying the coordinate transformationtH of stochas¬ 
tic updates, z = u + Ce, e ^ Af(0,I), introduced 
in dRezende et aP . 12014 : Titsias and Lazaro-Gredilla . 


2014r i. the gradient of the ELBO w.r.t. variational pa¬ 


rameter (/X, C) can be written as 


= EqG(2) [Vz4(2:,h) - V^lnqciz)] , 

Vc£c = E,o( 2) [VH(4(2,h) - VHlngG(2))e^] , (10) 


i.e., if we generate u ^ W[0,1], then x = ^*{u) ^ Q*. 
Q* is closest to the true distribution F with the min¬ 
imum KL divergence. This can be interpreted as 
the variat ional counterpa rt of the inverse transform 
sampling I Devrovel . Il986l l. termed as variational in¬ 
verse transform (VIT). Our BP-based construction 
^(■) = Q~^(-) = k,uj)) is one appropriate 

parameterization scheme for the inverse probability 
transformation (5“^(-)- VIT-BP offers two clear ad¬ 
vantages. First, as opposed to fixed-form variational 
Bayes, it does not require any specification of para¬ 
metric form for q(x). Second, the difficult task of 
calculating the general inverse CDFs Q”^)-) is less¬ 
ened to the much easier task of calculating the prede¬ 
fined tractable inverse CDF 4“^(-). Some choices of 
'!'(•) include CDF of A/’(0,1) for variables in (—oo, oo), 
Beta(2, 2) for truncated variables in (0,1). 


To be consistent with VIT, we shall set $(■) in © to 
be cr^), instead of <I>(-|0,1), such that u is al¬ 

ways uniformly distributed. Ideally, BP itself suffices 
to represent arbitrary continuous distribution function 
on the unit interval. However, it might require a higher 
order k. As is demonstrated in Section o this re¬ 
quirement can be alleviated by incorporating auxiliary 
parameters {/r, a^} in VGC-BP, which potentially help 
in changing location and dispersion of the probability 
mass. 


5 Stochastic VGC 

The derivations of deterministic VGC updates are 
highly model-dependent. First, due to the cross terms 
often involved in the log likelihood/prior, the corre¬ 
sponding Gaussian expectations and their derivatives 
may not be analytically tractable. Second, owing to 
the non-convex nature of many problems, only lo¬ 
cally optimal solutions can be guaranteed. In con¬ 
trast, stochastic implementation of VGC only requires 
the evaluation of the log-likelihood and log-prior along 
with their derivatives, eliminating most model-specific 
derivations, and it provides a chance of escaping local 
optima by introducing randomness in gradients. 


where the stochastic gradient terms 


V^~4(2) = ViT lnp(y, h{z)) -k V^~ lnh)(zj) 
d\Rp{y,x) 


dxi 


Inh'Azj ). 


According to the chain rule, the first derivative of h{-) 
w.r.t z is, 

^ d^-^[BmA);k,u;)] dH($(?);fc,a;) d$(g) 

dB{^(z)]k,LS) d^{z) dz 

_ b{<^{T)-,k,uj)(l){T) 

^ ^ 

where b{u]k,uj) = J2r=i^r,kl3{u-,r,k - r -k I), 
P{x; a, b) is the beta density /3(x; a, b) = 
r(a-k &)/(r(a)r(6))a;““^(I — x)^~^. Therefore, 
\nh'{z) = Infc, w) -k \n((){z) — ln'0(/i(z)) and 
Inh'(2j) = hj{zj)/h'j{zj) all take analytical 

expressions, where 


h'jizj) = [p[iZj)p2{Zj)p3iZj) + pi{Zj)p'^{Zj)p3iZj) 
- Pi {Aj)p2 (Zj)p3 (Zj)]/ [p3 (Zj)]^, 


where pi{zj) = p 2 {zj) = (l){zj), p 3 {zj) = 

iP{hj{zj)), P'i{zj)^= fc-r-kl), 

P2(Zj) = -Zj(t){Zj), p'3{Zj) = lj}'{hj{Zj))h'^{Zj), Uj = 

^{zj), </>(•) is the PDF of jV’(0,1), ■!/'(•) and ^'(•) are the 
predefined PDF and its derivative respectively. Defin¬ 
ing /I(x; a, 0) = I3{x] 0, b) = 0, the derivative is written 
as a combination of two polynomials of lower degree 
P'{x; a, &) = (a -k & — I)[/3(a:; a — 1, 6) — (3{x] a,b — I)]. 


In stochastic optimization, the gradients expressed in 
terms of expectations are approximated using Monte 
Carlo integration with finite samples. The gradients 
contain expectations on additive terms. Note that 
Rezende et al. ( 20I4[) and iTitsias and Lazaro-Gredillal 


(j2ni4l ) ignore the stochasticity in the entropy term 


^If necessary, the Gaussian copula can be replaced with 
other appropriate parametric forms. The coordinate trans¬ 
formation supports many other distributions as well, for 
example, those described in Appendix C.2. of Rezende et 
al. (2014). 
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Algorithm 1 (VGC-BP) Stochastic Variational Gaussian Copula Inference with Bernstein Polynomials 

Input: observed data y, user specified model \np{y,x) and first-order derivatives Va; lnp(y, a;), Bernstein 
polynomials degree fc, predefined ^'(•) and $(■) 

Initialize variational parameter ©o = Cq, t = 0. 

repeat 

t = t 1, 

Sample e ^ qc (e, 0, Ip), and set z = -I- Ct-ie, 

Pj = -I- Xt[yz-^s{z,h) — V2lngG(2)], % Update Ht-i with stepsize A* 

Ct = Ct-i + r]t['^z(-s{z,h) — V2lngG(5)]e^, % Update Ct-i with stepsize rjt 

for j = 1 to p do 

Vjj(3)£s(2 , h)), % Update with stepsize and gradient projection V 

end for 

until convergence criterion is satisfied 

Output: marginal parameters (^{u}^^'^}j=i,p, and copula parameters T 


E 9 g( 5 )[“ 11190 ( 2 :)] and assume V^]E,^( 5 )[-lngG( 2 :)] = 
0 and VcEgG( 5 )[-lii 9 G( 2 )] = diag[l/C'jy]j=i:p. This 
creates an inconsistency as we only take finite samples 
in approximating £^^( 2 )[V 5 £s( 2 )], and perhaps sur¬ 
prisingly, this also results in an increase of the gradient 
variance and the sensitivity to the learning rates. Our 
method is inherently more stable, as the difference be¬ 
tween the gradients, Vz[£s{h{z)) — qG{z)], Vz, tends to 
zero when the convergent point is approached. In con¬ 
trast, the gradients in previous method diffuses with 
a constant variance even around the global maximum. 
This phenomenon is illustrated in Section [6.21 


The alternative log derivative approach are also appli- 
cable to VGC infer e nce and other types of co pulas, see 


Paisley et a 


]j 201^ MnihandGrego^ 20lJ^Rezende 


et al. (j2ni4l ) for references. We leave this exploration 


open for future investigation. 


5.2 Update the BP Weights 


Under a given computational budget, we prefer a 
higher degree k, as there is no over-fitting issue in this 
variational density approximation task. Given k, the 
basis functions are completely known, depending only 
on index r. The only parameter left to be optimized 
in the Bernstein polynomials is the mixture weights. 
Therefore, this construction i s relatively simpler than 
Gaussian mixture proposa ls ( Gershman et al. 1 , 120121 : 
Nguven and Bonilla! . 1201411 . Assuming permissibility 
of interchange of integration and differentiation holds, 
we have V^o)Uc = EgcU) N^(o)^s(z,h,y)], with the 
stochastic gradients 


= V^o) lnp{y,h{z)) -f V^o) Inh'Jzj) 


d\np{y,x) 


dxj 


dhj{zj) 


- dw. 


U) 

r,k 


- r—l:k 


dlnh'Azj) 


duj, 


U) 

r,k 


. r=l:k 


where 

dhj{zj) _ d'S~^[B{uj;k,u}^^'>)] _ (r, fc - r-f 1) 


d\nh'j{zj)/= I3{uj-,r, k — r + l)/ 6 (uj; fc, 
'>P'ihj{zj)) 




Iuj(r, k-r + l). 


The gradients w.r.t turn into expectation straight¬ 
forwardly, to enable stochastic optimization of the 
ELBO. To satisfy the constraints of on the prob¬ 
ability simplex, we app ly the gradient pro jection op¬ 
eration V introduced in iDuchi et al.l (|2008l ) with com¬ 
plexity O(fclogfc). The above derivations related to 
BPs together with those in Section 15.11 are all ana¬ 
lytic and model-independent. The only two model- 
specific terms are \D.p{y,x) and d\n-p{y,x)/dx. The 
stochastic optimization algorithm is summarized in Al¬ 
gorithm [TJ with little computational overhead added 
relative to stochastic VG. The stability and efficiency 
of the stochastic optimization algorithm can be further 
improved_b^_embedding_^da£tive_subroutines_(Duchi 


et al.. l 2 nil|i and consideri ng second-order optimization 
method ( ihn et all . l2015l ). 


6 Experiments 

We use Gaussian copulas with fixed/free-form mar¬ 
gins as automated inference engines for posterior ap¬ 
proximation in generic hierarchical Bayesian mod¬ 
els. We evaluate the peculiarities reproduced in 
the univariate margins and the posterior dependence 
captured broadly across latent variables. This is 
done by comparing VGG methods to the ground 
truth and other baseline methods such as MGMG, 















































Shaobo Han, Xuejun Liao, David B. Dunson, Lawrence Carin 


MFVB, and VG (see Supplementary Material for de¬ 
tailed derivations). Matlab code for VGC is available 
from the GitHub repository: https: //github. com/ 
shaobohan/VariationalGaussianCopula 


6.1 Flexible Margins 

We first assess the marginal approximation accuracy of 
our BP-based constructions in Section i.e., h{-) = 
fc, u))) via 1-d variational optimization, 
where z ^ A/'(0,1) in VIT-BP, and 'z ^ in 

VGC-BP. For fixed BP order k, the shape of q{x) is 
adjusted solely by updating a;, according to the vari¬ 
ational rule. In VGG-BP, the additional marginal pa¬ 
rameters {/i,(T^} also contribute in changing location 
and dispersion of q{x). Examining Figure [TJ VGG- 
BP produces more accurate densities than VIT-BP un¬ 
der the same order k. Hereafter, the predefined !'(•) 
for real variables, positive real variable, and truncated 
[0,1] variables are chosen to be the CDF of A/’(0,1), 
Exp(l) and Beta(2,2), respectively. 





^—Ground Truth 
-^VIT-BP(k=5) 
- VGC-BP(k=5) 


—^Predefined 4' 



(d) Beta(0.5, 0.5) 


Figure 1: Marginal Adaptation: VIT-BP v.s. VGC-BP 


6.2 Bivariate Log-Normal 

The bivariate_Jogmm;mnJ_^DF_^(ij^^ 2 ]_l^itchison 
and Brown niObTl l is given bv 

p{xi,X 2 ) = exp (-C/2)/[27ra:ia:2cricr2\/l - p'^], 

a\{xi) — 2pai{xi)a2{x2) + a\{x2) 

where ai{xi) = (Incci — pLi)lai, i = 1, 2, —1 < p < 1. 

We construct a bivariate Gaussian copula with (z) Log¬ 
normal margins (VGC-LN) and (n) BP-based margins 
(VGC-BP). We set = P 2 = 0.1 and cti = tT 2 = 0.5, 
p = 0.4 or —0.4 (first and second row in EigureO. 
Both VGC-LN and VGC-BP methods presume the 



correct form of the underlying copula (bivariate Gaus¬ 
sian) and learn the copula parameters p. VGC-LN 
further assumes exactly the true form of the univari¬ 
ate margins (log-normal) while VGC-BP is without 
any particular assumptions on parametric form of mar¬ 
gins. Eigure [D shows that VGC-BP find as accurate 
joint posteriors as VGC-LN, even though the former 
assumes less knowledge about the true margins. 



Figure 2: Approximate Posteriors via VGC methods 


In updating (/ x, C), VGC-LNl and VGC-B Pl follow 
the scheme in ([Titsias and Lazaro-Gredillal . l2014f) and 
neglect the stochasticity in the entropy term; while 
VGC-LN2 and VGC-BP2 are based on our scheme in 
(nni). Under the same learning rates, we define the rel¬ 
ative mean square error (RMSE) of the copula param¬ 


eter as R{p) = ; both VGC-LN and VGC-BP 

results in Eigure|3| consistently show that our method 
leads to less noisy gradients and converges faster. 





Figure 3: RMSE(p) of VGC-LN and VGC-BP v.s. Itera¬ 
tions; Left two: p = 0.4; Right two: p = —0.4 


6.3 Horseshoe Shrinkage 


The horseshoe distribution ([Carvalho et ah . 201Cll l 
can b e represented i n equ ivalent conjugate hierar¬ 
chies ( Neville et al.l . l20I4ll y\T ~ A/’(0,t), r|A 
InvGa(0.5, A), A ~ InvGa(0.5,1). Here we assume 
y = O.OI is the (single) observation. Denoting x = 
{xi,X 2 ) = (t, 7 = I/A), we implemented the VGC- 
BP algorithm (fc = 10) and VGC-LN algorithms (de¬ 
terministic implementationt0 are available in this spe¬ 
cial case). We compared them with two baselines: (z) 
Gibbs sampler (1x10® samples), and (zz) MFVB. From 
Figure m it is noted that the VGC methods with full 
correlation matrix (VGC-LN-full, VGC-BP-full) are 
able to preserve the posterior dependence and alleviate 


®For gradien t updates , we u se a quasi-Newton strategy 
implemented in ISchmidtl ll2012f) . 
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Methods 

ELBO 

MFVB 

-1.0778 

VGC-LN-full 

-0.0634 

VGC-LN-diag 

-1.2399 

VGC-BP 

0.3277 


Figure 4: (Left Panel) Approximated Posteriors (Shown in 
Log Space for Visualization Purpose); (Right Panel) com¬ 
parison of ELBO of different variational methods 

the under-estimation of the posterior variance. VGC- 
LN-full lead to higher ELBO than MFVB, and the 
gain is lost with factorized assumption T = / (VGC- 
LN-diag) in which case the Gaussian copula reduces to 
the independence copula. The restriction of paramet¬ 
ric margins is relaxed in VGG-BP. With refinement of 
the mixture weights, VGG-BP leads to higher ELBO 
than VGG-LN. Since the Gaussian copula admits nei¬ 
ther lower nor upper tail dependence, the posterior 
dependence it is able to preserve can be restrictive. It 
is a future research topic to explore other copula fam¬ 
ilies that allow more complex posterior dependencies 
in variational copula inference. 


6.4 Poisson Log-Linear Regression 



Figure 5: Univariate Margins and Pairwise Posteriors 


QvGci^) is independent and faster, as compared to 
MGMG. The obtained posterior approximation could 
possibly improve the efficiency of Metropolis-Bastings 
(MB) samplers by r eplacing the MGMG prerun as a 
reasonable proposal ( Schmidl et al.l . 20l3). 


The proposed method is an automated approach of 
approximating full posteriors. It is readily applica¬ 
ble to a broad scope of latent Gaussian models with 
non-conjugate likelihoods. Compared with the inte- 
grat ed nested Lanlace approximation (INLA) ( Rue 
et ah, 1200^ and integrated non-factorized variational 
inference ( Ban et ah . 20131) . our approach does not 
need to discretize the space for non-Gaussian variables 
and thus does not suffer from the limits on the number 
of hyperparameters. 


We consider the t ropical rain forest dataset ( Moller 
and Waagepetersen, 200?! ). a point pattern giving the 
locations of 3605 trees accompanied by covariate data 
giving the elevation. Resampling the data into a grid 
of 50 X 50m {ui locates the i-th grid), the number of 
trees yi per unit area is modeled as, yi ^ Poisson(^i), 

i = 1,..., n, \og{yi) = /3o + PiUi + ^ 2 ^^ /3o ~ ^(0, r), 
/3i ~ V(0,t), (32 A^(0,r), r ~ Ga(l,l). We de¬ 

note X = (/3o,/3i,/32, t), and choosing 'I'~^(-) to be the 
CDF of A/’(0,1) or Exp(l) accordingly. The implemen¬ 
tation of VGC-BP leads to highly accurate marginal 
and pairwise posteriors (See Figure [5]), as compared 
to the MGMG sampler (1 x 10® runs) implemented in 
JAGS0 as reference solutions. 


Interestingly, for non-conjugate models with unknown 
exact joint posteriors, VGC still provides a Sklar’s rep¬ 
resentation of the approximated posterior, including 
an analytical Gaussian copula, and a number of uni¬ 
variate margins (summarized as univariate histograms 
if not in closed-form). For further uses such as cal¬ 
culating sample quantiles, simulating samples from 

^http://mcmc-jags.sourceforge.net/ 


7 Discussions 


This article proposes a unified variational copula infer¬ 
ence framework. In VGC, we have focused on Gaus¬ 
sian copula family for simplicity, however, other more 
flexible forms such as Gaussian mixture copula can be 
considered as well. To avoid the difficulty of speci¬ 
fying marginals for hidden variables, a nonparametric 
procedure based on Bernstein polynomial s indirectly 
i nduce s hig hly flexible uniyar i ate m argins. iTran et al.l 
( 20151 ) and Kucukelbir et al.l ( 20151 ) could potentially 
benefit from our flexible margins, while our approach 
i s likely to b e nefit f rom the yine copula decomposition 


(|Tran et al.l . I2015I ) to allow richer or more complex 


dependencies an d the automatic d i fferen tiation tech¬ 
niques applied in iKucukelbir et al.l (|2015l) . 
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Supplementary Material 


B2: Student’s t Distribution 


A: KL Additive Decomposition 

Letting the variational proposal in Sklar’s repre¬ 
sentation be qvc{x) = fj{xj), and the 

true posterior be p{x\y) = c*{v)]^^ where 

u = F{x) = [Fi{xi),... ,Fp{xp)], V = F*{x) = 
, F*{xp)]. The KL divergence decomposes 
into additive terms, 


1. lnp(a;) oc —{v+l)/2\n{\Fx'^/v) and 

d\np{x)/dx = —{u + l)x/{v-\-x'^), > 0 is 

the degrees of freedom 

2. is predefined as CDF of A/’(0,1) 


B3: Gamma Distribution 


p{x\y) 

4^{x)] rij fjjxj) 


KL{q{x)\\p{x\y)} ^ J q{x)(\og 

-I 


dx 


= / c[F(a:)] log 


cjFjx)] 

’c*[F*{x)] 




+ J c[F(a:)] fj{xj) (^log j" ( J.) ) IT 

The first term in (1121) 


( 12 ) 


= j c{u) (^og-^ - 


’c*(F*(F-i(n))) 
= KL{c(n)||c*[F*(F-i(n))]}, 

The second term in (fT^ 


J 4F{x)]Ylfj{xj)(^og 


rij ^ -pT J 

n-wjn*. 


= 5Z / 4F'{x)]Y[fj{xj)(\og 




Y[dxj 


= log 


dXj (Marginal Closed Property) 


7;fe) 

J 

= ^KL{/d*dii/;7d}. 

j 

Therefore 

^H<lix)\\p{x\y)} = KL{c[F(a:)]||c*[F*(a;)]} 
-i;^KL{/da;,)||/*(a;,)} 


(13) 


B: Model-Specific Derivations 


1. lnp(a;) oc (a — 1) in a; — and 91np(a;)/9a; = 
(a — l)/x — P, a is the shape parameter, /3 is the 
rate parameter 

2. 'I'(a;) is predefined as CDF of Exp(l) 


B4: Beta Distribution 

1. lnp(a;) oc (a — l)lnx + {b — l)ln(l —a;) and 
d\np{x)/dx = (a — l)/x — (b — 1)/(1 — x), both 
a, b > 0 

2. ’l>(x) is predefined as CDF of Beta(2,2) 


B5: Bivariate Log-Normal 

1. lnp(a;i,a; 2 ) oc — lna;i — lna ;2 — C/2 and 


dln/(a;i,a;2) 

1 

ai(xi) - poi 2 {x 2 ) 

dxi 

Xl 

(1 - p2)a:icri 

dlnf(xi,X 2 ) 

1 

02(0:2) - pai(a;i) 

dX 2 

X 2 

(1 - p 4^202 


2. 4'(x) is predefined as CDF of Exp(l) 


C. Derivations in Horseshoe Shrinkage 

The equilvalent hierarchical model is 


3/|r ~ A/’(0, r), r|7InvGa(0.5,7), 7 ~ Ga(0.5,1) 


Bl: Skew Normal Distribution 


Cl: Gibbs Sampler 


1. Inp(x) oc ln//>(x) -f ln<i>(ax) and dlnp(x)/dx = The full conditional posterior distributions are 
—X -I- a(/>(ax)/<l>(ax), a is the shape parameter 

2. 'I'(a;) is predefined as CDF of Af(0, 1) 


p(t|j/, 7) = InvGa(l,i/V2+ 7 ) , p( 7 |t) = Ga (l, r ^ + l) 
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C2: Mean-field Variational Bayes 

The ELBO under MFVB is 


C4: Stochastic VGC-LN 

The stochastic part of the ELBO is, 


/iMFVB[(?VB('7-,7)] = E,j(^)q(^)[lnp(2/,r,7)] 

+ Hi[q{T;ai,Pi)] + Ll2[g(7; «2,/32)] 

where 


£s{z) = Co + 22 - 2l - 


y^exp(-2:i) 

2 


exp{z 2 — z\) — exp{z 2 ) 


and 


E 5 (r)q( 7 )[lnp(t/,T, 7 )] = -0.51n(27r) -21nr(0.5) -2{lnT) 
- (t"^) /2 - (7 > (t"^) - (7) 

Hi[q{T\ai,Pi)\ = ai + ln/3i + In [r(ai)] - (1 + ai)^/j(ai) 
712[?(7; a2,/32)] = 02 - ln/32 + In [r(a2)] + (1 - 02)V'(a2) 

The variational distribution 

q{T) =16 (T;ai,/3i) = IQ 12 + (7)) , 

5(7) = g{r, 02, P2) = e (7; 1, + 1) 

where 

(Inr) = Indi - i/’(ai) = in (y^/2 + (7>) - '^(1), 

/_-l\ _ Oil _ 1 / \ _ *^2 _ 1 

(yV2 + (7))’ ^ ^ “ (r-i) + 1 

C3: Deterministic VGC-LN 


= -1 + + exp(F2 - £i) 

Vy24(5) = 1 - exp( 2'2 - zl) - exp( 2 i) 

C5: Stochastic VGC-BP 

1. lnp(y,xi,X 2 ) = Co - 21na:i - l{2x^) - X 2 IX 1 - X 2 , 

dlnp{y,xi,X 2 )/dxi = -2/xi +y^/{2xl) +X 2 lxi, 
dlnp{y,xi,X 2 )/dx 2 — —1/xi — 1 

2. 4 ^( 2 ;) is predefined as CDF of Exp(O.Ol). 

D. Derivations in Poisson Log Linear Re¬ 
gression 

For i = 1,..., n, the hierarchical model is 


Denoting x = (xi,a;2) = (r,7), we construct a vari¬ 
ational Gaussian copula proposal with (1) a bivari¬ 
ate Gaussian copula, and (2) fixed-form margin for 
both XI = T £ (0,00) and 2:2 = 7 € (0,00); we em¬ 
ploy fj{xj ; fj,j , cr|j) = CJ\f{xj; fij , cr |^-),Xj = hj{zj) = 
exp(zj) = g{zj) = exp{ajjZj + pj), j = 1,2. The 
ELBO of VGC-LN is 


tivGc(/ 2 , C) = Cl — pi +P 2 — 


exp (^-pi + 


— to - exp ( p 2 + 


C 21 + G 2 : 


-hlnlCI 


to = exp ^{p 2 - pi) + ■ 

where co = —0.5 In (27r) — 2 lnr(0.5), ci = co -I- In (27re). 
The gradients are 


yi Poisson(pi), log(/ii) = do + l3iXi + ^ 2 X 1 , 
do~iV(0,r), di"-N(0, r), /32~V(0, r), r--Ga(l, 1) 

The log likelihood and prior, 


lnp{y, l3,r) = Y^ lnp{yi\f3) + InMiPo', 0, r) + in A/'(/3i; 0, r) 

i=l 

-I- in A/’(d 2 ; 0, r) + In Ga(r; 1,1) 

where lnp{yi\f3) = yilnpi — pi — Inj/d, and pi = 
exp (do + di^i + P 2 x‘i). 

The derivatives are 


dInp(y,/3,-r) 

dPo 


n 

'^iVi - Mi) 

,i=l 


— r ^do 


dCvGcjp, C) 

dpi 

dCvGcip, C') 
dp2 

dTvGc(/2, C) 

dCii 

dCvGcip, C) 
dC2i 

dCvGcjp, C) 
dC22 


-1 + ^ exp ( ^ - pi ) -h to 


= 1 — to — exp p 2 + 


C‘ 


G 21 + G 2 : 


= ——Gii exp ( -Ml ) — (Gii — G 2 i)to -1- 

G 21 -I- G 22 


= (Gii — C 2 i)to — G 21 exp I p 2 -I- 


d\np{y,P, 

dpi 


d\np{y,P,T) 

dp2 


'^Xi{yi - Pi) 


^xfiyi - Pi) 


— T ^Pl 


- T ^P2 


= —022^0 — C 22 exp I P 2 + 


C 21 + C 22 


C 22 


d\np{y,(3,T) 3 do + di + dl , ao - 1 ^ 

- Wr - - +- 275 -+ 






























